5  Part Two - Processing Raster Data with Apache Sedona and Sparklyr in R

Using Raster Tiles to Determine Local Climate Zones of Pickup and Dropoff Locations

5.1 Introduction

For this chapter, we would like to find out the land cover classification associated with specific pickup and dropoff points. Our assumption is that the type of area where one picks up a taxi or gets dropped off may influence how long the trip takes. Again, do not dwell too much on this assumption — the main objective is to demonstrate another way of extracting data from raster files.

We shall make use of the Local Climate Zones (LCZ) Map from the World Urban Database and Access Portal Tools (WUDAPT). The US version of this dataset can be accessed here. This dataset contains 17 urban land cover classifications, ranging from compact high-rise buildings to water bodies.

We downloaded a version that was already in EPSG:4326 CRS and clipped it based on the NYC boundary. For this chapter, we demonstrate how to divide a raster image into tiles before performing analysis. This is particularly useful when working with large raster files that cannot be processed in their entirety. By dividing the raster into tiles, we make the analysis more manageable and efficient.

Note: Raster processing is computationally intensive, so it is preferable to work with smaller files, especially when operating in local mode. You will know a raster file is too large when you receive an error upon attempting to view its contents.

Important: The Spark configuration used in this chapter is the same as that used in Chapter 2, and so it is not repeated here for brevity.

5.2 Loading the data

We start by loading our most updated locations data.

# Read location data stored in Delta format
locations_sdf_updated_two <- spark_read_delta(
  sc,
  path = file.path(getwd(), "data", "locations_sdf_updated_two")
) |> 
  filter(trip_id > 40000000) %>%  # Optional filtering
  sdf_repartition(partitions = 24) %>%  # Repartition for parallelism
  sdf_register("locations_sdf_updated_two_view")  # Register as temporary view for SQL
# Check partition sizes to understand data distribution
locations_sdf_updated_two %>% 
  sdf_partition_sizes()

We then load our raster data.

# Define path to raster file (LCZ map)
wudapt_raster_filepath <- file.path(
  getwd(),
  "data",
  "raster",
  "wudapt",
  "CONUS_LCZ_map_NLCD_v1.0_cropped.tif"
)
# Read raster as binary using Spark
wudapt_binary <- spark_read_binary(
  sc,
  dir = wudapt_raster_filepath,
  name = "wudapt_binary_view"
)

5.3 Creating raster tiles

Here, we are assuming that our raster data is too large to be processed in its entirety. To address this, we divide it into 256 by 256 tiles. By doing so, we make it more analysis-friendly, as we can first identify a specific tile to work on and then perform analysis on only that tile, instead of the entire raster.

Note: The clipped dataset used here was actually not large, but we proceeded with this method for demonstration purposes to show how one would handle genuinely large raster datasets.

# Explode raster into tiles (256x256) for spatial operations and cache in memory
wudapt_raster_tiles <- sdf_sql(
  sc,
  "
  SELECT RS_TileExplode(RS_FromGeoTiff(content), 256, 256) AS (x, y, tile) FROM wudapt_binary_view
  "
) %>%
  sdf_register("wudapt_raster_tiles_view")

# Quick look at raster tiles structure
wudapt_raster_tiles %>% glimpse()

As you can see below, the spatial SQL query is divided into two parts:

  1. First, we locate the tile that a given coordinate pair (pickup or dropoff point) belongs to.
  2. Second, we use that specific tile to find the actual land classification value associated with that location.

This two-step approach makes it efficient to work with large raster datasets by limiting the analysis to only relevant tiles, rather than processing the entire raster at once.

# Spatial join: match location points with corresponding LCZ tiles and assign LCZ labels
locations_sdf_updated_three <- sdf_sql(
  sc,
  "
  WITH matched_tiles AS (
    SELECT l.*, w.tile
      FROM wudapt_raster_tiles_view w
    JOIN locations_sdf_updated_two_view l
      ON RS_Intersects(w.tile, ST_Point(l.longitude, l.latitude))
  )

  SELECT *,
    TRY_CAST(RS_Value(tile, ST_Point(longitude, latitude)) AS INT) AS lcz_class,
    CASE
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 1 THEN 'Compact highrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 2 THEN 'Compact midrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 3 THEN 'Compact lowrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 4 THEN 'Open highrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 5 THEN 'Open midrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 6 THEN 'Open lowrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 7 THEN 'Lightweight lowrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 8 THEN 'Large lowrise'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 9 THEN 'Sparsely built'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 10 THEN 'Heavy industry'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 11 THEN 'Dense trees'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 12 THEN 'Scattered trees'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 13 THEN 'Bush, scrub'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 14 THEN 'Low plants'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 15 THEN 'Bare rock or paved'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 16 THEN 'Bare soil or sand'
      WHEN RS_Value(tile, ST_Point(longitude, latitude)) = 17 THEN 'Water'
      ELSE 'Unknown'
    END AS lcz_label
  FROM matched_tiles
  "
)

We now remove the tile geometry from our dataset because delta cannot serialise geometry columns, and also because we do not need the geometry for our intended analysis. This helps keep the dataset clean and light for storage and further processing.

# Drop RasterUDT (tile) column before saving as Delta (unsupported type in Delta)
locations_sdf_updated_three <- locations_sdf_updated_three %>%
  select(-tile)

The updated data now looks as shown below. As you can see, we went from only having coordinates to obtaining richer information about our locations by leveraging additional geospatial datasets.

Not too bad for using a simple laptop to process tens of millions of rows!

# Preview updated data
locations_sdf_updated_three %>% glimpse()
# Print formatted preview for review
withr::with_options(
  list(pillar.sigfig = 8),
  print(locations_sdf_updated_three, n = 10, width = Inf)
)

Finally, we save the data for further processing.

# Define output path for writing final dataset
locations_sdf_updated_three_file_path <- file.path(
  getwd(),
  "data",
  "locations_sdf_updated_three"
)

# Write enriched data back to Delta format (append mode)
spark_write_delta(
  locations_sdf_updated_three,
  path = locations_sdf_updated_three_file_path,
  mode = "append"
)

And disconnect from our spark instance.

# Disconnect Spark session to free resources
spark_disconnect(sc)